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Abstract. Ranking is one of the most fundamental problems in ma¬ 
chine learning with applications in many branches of computer science 
such as: information retrieval systems, recommendation systems, ma¬ 
chine translation and computational biology. Ranking objects based on 
possibly conflicting preferences is a central problem in voting research 
and social choice theory. In this paper we present a new simple combi¬ 
natorial ranking algorithm adapted to the preference-based setting. We 
apply this new algorithm to the well-known scenario where the edges of 
the preference tournament are determined by the majority-voting model. 
It outperforms existing methods when it cannot be assumed that there 
exists global ranking of good enough quality and applies combinato¬ 
rial techniques that havent been used in the ranking context before. 
Performed experiments show the superiority of the new algorithm over 
existing methods, also over these that were designed to handle heavily 
perturbed statistics. By combining our techniques with those presented 
in [T], we obtain a purely combinatorial algorithm that answers correctly 
most of the queries in the heterogeneous scenario, where the preference 
tournament is only locally of good quality but is not necessarily pseu¬ 
dotransitive. As a byproduct of our methods, we obtain the algorithm 
solving clustering problem for the directed planted partition model. To 
the best of our knowledge, it is the first purely combinatorial algorithm 
tackling this problem. 


1 Introduction 

1.1 Background 

The problem of ranking arises in many important applications of computer sci¬ 
ence such as information retrieval systems (e.g. the design of modern search en¬ 
gines), recommendation systems, computational biology and many more. There 
are two main approaches to the ranking problem. In the score-based setting the 
input is a sample of pairwise preferences from the dataset. The goal is to learn 
the so-called scoring function f : U ^ R inducing a linear ordering on the set 
of all the objects U. Several algorithms were proposed here. This setting was 
considered for example in [5] and [5]. In [3] an SVM-based ranking algorithm 
for this scenario was presented. Other algorithms include PRank given by [S] 
and [B]. In this paper we focus on the preference-based setting though. In this 
setting what is given is a preference function h : U x U ^ R taking values from 
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the interval [0,1]. For a pair {x,y) G U x U the closer h{x,y) to 0 the more 
confident we are that x is ’’better” than y and vice versa. Therefore the values 
of h may be interpreted as probabilities. Notice that such a function induces a 
directed graph (digraph) with weighted edges, where the weights of edges are 
taken from the interval [0,1]. The goal is to find high-quality consistent rankings 
from such pairwise observations. From now on we call the aforementioned graph 
a preference graph of h or simply: a preference graph. When this directed graph 
is a tournament (i.e. all the edges are defined), as it will be the case in our 
setting, we call this graph a preference tournament. This approach to ranking 
was introduced in and led to several interesting results (0, i, m)- Some¬ 
what similar model was considered also in m- Notice that h does not need to 
induce a linear ordering. In particular, the preference graph may not be a dag 
(i.e. it may contain directed cycles). In the tournament setting this means that a 
preference tournament does not have to be transitive. This is motivated by real 
data. The collection of pairwise preferences from which the preference graph is 
constructed may be aggregated from several noisy sources and, therefore, some 
preferences may give rise to inconsistencies or contradictions. For instance, the 
pairwise preferences taken in aggregate may not induce a consistent ranking over 
all the objects. Possibly conflicting preferences give rise to many directed cycles 
in the preference graph. As a result, the preference graph itself may be very far 
from being a dag. This implies that there may not exist a global good-quality 
scoring function. 

There are several results proposing ranking of objects in the setting where 
the preference tournament is not consistent but the notion of the global ranking 
of good quality makes sense (the so-called pseudotransitive setting). For defi¬ 
niteness let us assume right now that the preference graph under consideration 
is unweighted, i.e. all existing edges have weight 1. In this scenario the goal is 
usually to find an ordering of the vertices of the preference graph that induces 
as few backward edges as possible. Investigating all possible permutations of the 
set of vertices of the preference graph is usually (when the set of objects to rank 
is very large as it will be in our scenario) untractable. The problem of finding 
the permutation of vertices of a given digraph that minimizes the size of the set 
of backward edges, which in the literature is called a feedback arc set problem^ 
is NP-hard. However there exist several approximation algorithms that output 
orderings with not too many more backward edges (see for example: [II]). A 
significant breakthrough was done in |12j where a simple 3-approximation ran¬ 
dom algorithm for the feedback arc set problem working in 0(n log(n)) time was 
given, where n is the number of vertices of a given tournament. The novel and 
counterinuitive idea was to use a quick-sort approach with pivot points chosen 
at random for the input graph that does not necessarily have a linear ordering 
of vertices. All those results can be generalized to the weighted setting. In that 
case the reasonable objective function to work with is the sum of weights of 
backward edges. This variation, as mentioned earlier, models the scenario where 
the set of different pairwise preferences expresses heterogeneous certainty level 
or heterogeneous importance. This setting is known as the weighted feedback 


Title Suppressed Due to Excessive Length 


3 


arc set problem. Many formal results regarding this problem were proved by |18) 
and [14] . Such a problem was also considered in [1] , where it was showed how to 
extend the quick-sort approach to the general weighted preference tournaments 
with weights taken from the interval [0,1]- 

1.2 Our contribution - strongly heterogeneous setting 

Our results should be viewed as a further extension of the purely combinatorial 
approach from [1] for the setting when optimizing the size/weight of the set 
of backward edges is not the right thing to do and thus the methods discussed 
before fail. As we have already noticed, the statistics that are given as an input to 
the ranking algorithm may be heavily perturbed. This makes learning the global 
ranking very difficult if not impossible in practice. All methods discussed so far 
may suffer from significant inconsistences and noise added to the input data. 
If there does not exist a global ranking of good quality (i.e. if the assumption 
that a preference tournament is pseudotransitive is not legitimate) the need 
arises to find local good quality rankings. Thus every ranking algorithm needs 
first to cluster the preference tournament into locally pseudotransitive chunks 
(i.e. chunks that can be made transitive after reversing only few directed edges) 
and then perform ranking algorithms separately on each chunk. The clustering 
becomes a necessary preprocessing step. 

We give in this paper the first purely combinatorial clustering algorithm in 
the directed setting that partitions preference tournaments into small number 
of pseudotransitive clusters. We combine it with the existing ranking methods 
to obtain new effective framework for ranking with heavily perturbed preference 
tournaments. We also conduct extensive evaluation of this clustering-|-ranking 
paradigm by comparing our approach with several state-of-the-art techniques, 
also those that focus on the setting with heavily perturbed statistics. 

Our results can be applied in many different ways. One natural application 
regards the majority-voting model which is widely used to obtain the preference 
tournament. In this setting users vote to determine which one from the pair of 
objects should get higher rank and the majority decides. Different pairs of objects 
attract different sets of users and the number of votes reflects the demand for 
the right evaluation of the given pair. The heterogeneity here may be implied by 
the fact that it does not make sense to compare objects belonging to different 
categories/domains (such as favourite cars with favourite movies) or simply there 
is not enough data to precisely compare objects from different categories. Those 
categories however do not always have to be obvious in advance and may depend 
on the characteristic of the users. Thus any algorithm that aims to rank in this 
scenario needs also to learn the categories with good precision since only ranking 
within a given category is meaningful. The algorithm should not assume that a 
domain is known even for a single data point. The exact number of groundtruth 
domains as well as their sizes (that may differ) are not necessarily known in 
advance. 

After learning from the preference tournament, the ranking engine receives a 
stream of queries from the users and needs to correctly answer them. Each query 
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is taken from the same distribution that was used to construct the preference 
tournament and is of the form {ui, 1 x 2 }, where ui, U 2 &hl are taken from the uni¬ 
verse of all the objects. The answer indicates which object has higher rank. We 
show that our algorithm may be easily applied in this setting to answer correctly 
most of the queries while the other approaches fail. The preference tournament 
model arising here is an example of the more general planted partition model of 
the preference tournament. This general model is a subject of our theoretical 
analysis. Planted partition model (that gained attention because of its applica¬ 
tions in many fields of applied computer science) was extensively studied in the 
context of clustering undirected graphs (see section below) but not too many 
results regarding the directed setting are known. 

l. 3 Related work - heterogeneous setting and noisy statistics 

The most straightforward way to analyze the heteregeneous setting described 
above is the planted partition model. The planted partition model is usually 
considered in terms of undirected graphs but there is an analogous directed for¬ 
mulation. Several algorithms to reconstruct the groundtruth clustering that was 
used to obtain planted partition model were considered. Many of them use spec¬ 
tral partitioning techniques. Some of the most notable approaches are those of 

m, where perturbation theory techniques from [16j were applied as well as the 
results of HZI. Those results consider however mainly undirected setting where 
the domains induce dense graphs and there are not too many edges between dif¬ 
ferent domains. Much less research was done in the directed setting. In [TH] the 
clustering with the idea of weighted cuts was considered. It has to be emphasized 
that all the papers touching the problem of clustering directed networks (see also: 
m, m) have a very different goal than our clustering algorithm. In all these 
approaches a strongly connected component is considered to be a good cluster. 
It does not make sense in our setting, where the entire preference tournament is 
with high probability strongly connected and clusters are in fact related to sub¬ 
tournaments that are very far from being strongly connected. There were other 
papers discussing learning how to rank in the noisy setting such as m, where 
the noisy decision tree is the subject of analysis, or |22j . where noisy compar¬ 
isons between pairs of strategies are performed. Both settings are substantially 
different from ours. In particular, none of them solves the clustering problem for 
directed graphs that is unavoidable in our scenario. Some of the most effective 
methods to rank, also with preference tournaments and for heavily perturbed 
statistics, are presented in [53] and [53] . Those methods will be compared with 
our approach in the experimentals section (see: Appendix). 

This work is organized as follows: 

— In Section 2 we formally define the heavily perturbed statistics setting as 
a directed planted partition model. We describe the problem that needs to 
be solved by the ranking algorithm in this setting and the majority-voting 
model as its very special case. 

— In Section 3 we present our ranking and clustering algorithms. 
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— In Section 4 we present all the theoretical results. 

— In Section 5 we give final conlcusions and discuss future work. 

— In the Appendix we give all the proofs, show experimental results, explain 
why our techniques may be also easily applied in the weighted setting, fi¬ 
nally - comment more on the algorithms and ineffectiveness of the previous 
methods. 

2 The model 

2.1 Planted partition model for heavily pertnrbed statistics 

Assume that we are given a tournament T with the set of vertices V{T) = Q, 
where: Q = Qi U ... U Qk- We call each Qi a domain. We denote Ui = \Qi\ for 
z = 1,..., fc. Every set Qi contains a preferred ordering of vertices that from now 
on will be called the canonical ordering of Qi and will be denoted as di. The 
directions of edges of T are chosen independently according to the following 
procedure. For ui,U2 € Qi a directed edge {u2,ui) is chosen with probability pi 
(pi <C 1) if Ml appears earlier than U 2 in 6i and with probability \ —pi otherwise. 
For Ml € Ili, M 2 € Qj {i ^ j) a directed edge (mi, M 2 ) is chosen with probability 
and a directed edge (m 2 , Mi) is chosen with probability = 1 — Pij {pij ^ 0). 
The publically available parameters of the model are: 

— the upper bound p„ on each pi, 

— the lower bound Pm on each pi j and, 

— the upper bound on the number of domains k. 

We call the ratio ^ the heterogeneity level of the preference tournament T 
and denote it shortly by het{T). 

Let us comment on this planted partition model for the preference tour¬ 
nament. The sets Qi will be called by us: groundtruth domains. The canonical 
ordering models the fact that within each domain there exists a good quality 
ranking that with very high probability induces only few backward edges (as¬ 
sumption: Pi ^ 1). The fact that the statistics regarding objects from different 
domains are inconsistent (and generally of much weaker quality) is modeled by 
the fact that there exists a nontrivial lower bound Pm on each pi j. Of course in 
the planted partition model we assume that Pm > Pu, i-e. het{T) > 1. The larger 
the value of het{T) is, the more heterogeneous the setting is with the quality of 
statistics significantly differing for different pairs. 

The objective of the ranking algorithm in this setting is to: preprocess data to 
get a good approximation of the groundtruth clustering and then to learn within 
each reconstructed cluster. Our novel contribution regards the preprocessing 
phase. Most known algorithms operated on the planted partition model need 
the exact knowledge of the parameters of the model. In our algorithms we will 
just need some nontrivial bounds Pm,Pu- 

We say that a set X is (1 — e)-pure if all but at most a fraction e of all the 
points from X are from the same groundtruth domain. We say that a set V of 
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the sets of vertices is (1 — e)-pure if every member of 7^ is (1 — e)-pure. The goal 
is thus to find an (1 — e)-pure partitioning V of most of the vertices of T with 
not too many parts and for small enough e (since one can always output as a 
1-pure partitioning a set of singletons). 


2.2 The majority-voting model 

This model is an important practical application for our algorithm and a mo¬ 
tivation for the planted partition model of the preference tournament. It is one 
of the most popular ways to construct the preference tournament. We formally 
define it now. 

Let U = {Ui, ...,Uk} be the universe of all the objects partitioned into k 
domains: Assume that there exists a global groundtruth ranking of 

all the objects. The preference tournament is constructed simply by collecting 
statistics regarding every unordered pair of different points from U (this is the 
training set) and choosing for each pair the preference that was given by the ma¬ 
jority of the users. Different pairs may be ranked by different users, in particular 
the sizes of the sets of statistics will vary from pair to pair. 

Let Vu be the probability distribution on the set of all unordered pairs of 
different points from U. It defines the probability that a specific pair {x,y} will 
be evaluated by the next user (in the training phase) or will be requested by the 
next user to be evaluated (in the test phase). The users choose pairs of points 
to evaluate/ask for evaluation independently. Each training point consists of an 
unordered pair of objects for the evaluation and the evaluation itself. Objects 
within a domain are compared much more frequently than between the domains. 
We say that a training set T is (M, m)-unbalanced in respect to the partitioning 
{Ui, ...,Uk} if every pair of different points from the same Ui was evaluated at 
least M times in the training phase and every pair of points from different: Ui, Uj 
was evaluated at most m times in the training phase. The bigger M and smaller 
m, the more heterogeneous setting we consider. Given a pair of objects, a single 
user in the training phase gives a correct comparison (i.e. consistent with the 
groundtruth ordering) with probability Psucc > The objective is to come up 
with the algorithm that gives correct answers to as many queries from the test 
set as possible. 

The threshold Psucc > ^ is a standard assumption in all ranking models that 
are based on many independent votes. It guarantees that the sufficient number 
of votes will enable the algorithm to predict the right comparison with very high 
probability. In our model however not all the pairs will get the sufficient number 
of votes and this is where the planted partition model of the preference tourna¬ 
ment described in the previous section comes into action. If we define by A4 the 
majority-voting model presented above and by Tm a related preference tourna¬ 
ment then the latter is constructed from the planted partition model introduced 
in the previous section. The parameters and Pm of can be easily derived 
from the parameters Psucc, M and m of the majority-voting model (details in 


Section 6.2). It turns out that we can use our clustering algorithm as a prepro¬ 


cessing step performed on that preference tournament and then combine it with 
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existing ranking methods to answer correctly most of the queries. As we will see 
in the experimental section (see: Appendix), we outperform the state-of-the-art 
methods that can be applied in this scenario. 


3 The Algorithm 

In this section we present both: the clustering algorithm for the digraph planted 
partition model of tournaments and the ranking algorithm for heavily perturbed 
statistics. 


Algorithm 1 - HeteroRanking 

Input: Preference tournament T with parameters: 

Pu, Pm , ku and a precision parameter e 
Output: Partitioning: V = {Vi, of all but at 

most e-fraction of E(T) and orderings: 

(71,..., at of Vi, 

begin 

run DagClustering(T,pu,Pm,ku,f) to obtain a partitioning V; 
run Purify(V,Pu,Pm,() to obtain TZ- 

for every Vi dV run QuickSort{T\Vi, TZ n Vi) to obtain an ordering within 
each cluster; 

end 


The ranking algorithm (HeteroRanking) uses clustering subroutine (DagClus- 
tering) and the so-called Purify subroutine (responsible for getting rid of outliers 
from the clusters of the learned clustering) and orders the vertices within each 
part of the obtained partitioning V. The ordering is performed by the Quick- 
Sort subroutine from [1] that uses as pivot points only points from the set TZ of 
” non-outliers” constructed by the Purify procedure (for a subset X C V{T) we 
denote by QuickSortifP^ X) the algorithm from [T] applied to the tournament 
r, but with pivot points taken from X instead of V{T)). When the cluster¬ 
ing and ordering of vertices within each part of P is done then the mechanism 
of answering queries is as follows: if the query (cc, y) satisfies: x,y G Si, where 
Si GT’, then output a point according to the computed ordering of Si. Otherwise 
answer randomly. The clustering algorithm {DagClustering) uses the so-called 
gadget structure H. Gadget is a small pseudo-random tournament. The only 
property that we want the gadget to satisfy is to have at least one backward 
edge under every ordering of every subset S C V{H) of size IS”! > A ran¬ 


dom tournament is a gadget with high proability (details in Section 6.5). Thus 
gadget can be trivially constructed in advance before the main clustering algo¬ 
rithm starts. There are also standard deterministic constructions of gadgets (the 
so-called quadratic residue tournaments, see [IS]). 

The algorithm uses also Find procedure, which is essentially a wrapper for 
the Searcher subprocedure. It takes as an input a digraph Ti (a subgraph of T) 
and tries to find a special embedding of H in Ti. It either finds this embedding (if 
this is the case the procedure returns the copy He of H) or returns two sets: X, Y. 
The directed density d{X, Y) from A to F is very close to 0 or 1. This, as we will 
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see later, implies that with very high probability most of the vertices oi X UY 
came from the same groundtruth domain. In other words, we obtained a set of 
vertices of very good purity. Thus the algorithm uses the local property of not 
having a particular pattern as a subtournament to reconstruct a significant part 
of the groundtruth cluster. This set is then added to the appropriate cluster 
of the partial clustering that was already calculated (or potentially forms a 
new cluster). The embedding we are looking for in Searcher is a very simple 
one, where each vertex is being looked for in the different part of the random 
partitioning of vertices into h equal-length chunks. 


Algorithm 2 - DagClustering 

Input: Preference tournament T with parameters: 

Pu , Pm, ku and a precision parameter e 
Output: Partitioning: V = {'Pi, ...,Pt} of all but at 
most e-fraction of K(T) 

begin 

let H — H{ku) be a gadget; 
let Ti — T and P = 0; 
while |Ti| > e|T| do 

run Find{H,Ti,e,Pm)’, 
if Find returns a copy Ft^ of FI then 
I delete all the edges of He from Ti; 
else 

let Z = X UY, where X, Y are the sets output by Find-, 
let Si = Z U Vi tor i = 1,..., where V = {Pi,...}; 
let back = -g 2e)\Z\\Vi\pu-, 

let i be the smallest index for which Quicksort {T\Si) outputs an 
ordering with no more than back backward edges in T with one 
endpoint in Z and the other in Pq 
if i exists then 

I replace in P cluster Pi by Sp, 
else 

I update: P P U {Z}-, 

end 

update: V(Ti) -e- V{Ti) \ Z-, 

end 

end 

output P ; 

end 


The Purify subroutine gets as an input a partitioning, where each part is a 
good approximation of the groundtruth cluster and eliminates outliers from each 
cluster. This can be effectively done by observing that outliers contribute in a 
much bigger extent to the total number of directed triangles of a particular type 
in the cluster than other nodes. Notice that the Purify procedure is not used by 
the digraph clustering algorithm. Since it does not shed any light on our main 
contribution in this paper - the clustering algorithm, we will comment more on 
that procedure in the Appendix. 
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4 Main theoretical results 

Now we state main theoretical results regarding algorithms presented in the 
previous section. All the proofs are given in the Appendix. 

Algorithm 3 - Find 

Input: Tournament H with V{H) = {ui,..., U|£f|}, 
digraph Ti, parameters: €,Pm 

Output: A copy He of H in Ti or two sets 
X,YgV{Ti) 

begin 

Initialization: 

let /i = |iL|, ni = iTil, c = jcpm and S' = 0; 
partition randomly V (T\ ) into h sets 
Wi ,..., Wh, each of size L^J; 
return Searcher{H, {Wi ,..., Wh}, S); 

end 

Our first result is about the general planted partition model for the pref¬ 
erence tournament and shows that DagClustering algorithm reconstructs with 
very good precision groundtruth domains. 

Theorem 1. Let T be a preference tournament with parameters Pu,Pm, ku and 
k groundtruth domains (k does not have to be publicly available). Assume that 
het(T) > 12, each groundtruth domain is of size at least two and has on expecta¬ 
tion at least log(|r|) backward edges under its canonical ordering. Let e be a pre¬ 
cision parameter satisfying: 2/iy^< e < \Pm)- Then for |T| large 

enough with probability Psucc = 1 ~ o(l) algorithm DagClustering with input pa¬ 
rameters: Pu,Pm, ku and e outputs an (1 — e)-pure partitioning V = {Vi ,..., Vf .'} 
of all but at most an e-fraction of all the vertices ofV{T) for some Q <k < k. 

Next theorem gives an upper bound on the generalization error of the Het- 
eroRanking algorithm for the introduced majority-voting model. The following 
is true: 

Theorem 2. Assume the majority-voting model M.. Let e < Pmi^Pm— het(T) )- 
Let T be an {M, m)-unbalanced training set. Assume that the number of objects 
to rank is large enough and that the related preference tournament Tj^ satisfies 
the conditions given in the statement of Theorem^ Let N be the total number 
of queries asked. Then for the average preference tournament with probability 
p = 1 — o(l) the ranking mechanism defined by the output of the algorithm Het- 
eroRanking answers correctly at least: N (1 — 2e)^(l — 4p„) queries. The 
average is taken under random coin tosses from the training phase. The proba¬ 
bility p is taken under random coin tosses from the test phase. 

In the statement above we can in fact get rid of averaging since the ran¬ 
dom variables under consideration are tightly concentrated around their means. 
Because it follows immediately from classic concentration inequalities, we leave 
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this check now and present it in the Appendix. Since we have M ^ m and 
Ptj <C 1, the presented ranking scheme answers correctly most of the queries. 
In comparison, existing state-of-the-art methods succeed much less frequently in 
this setting. In particular, most of them are very far from achieving a recall close 
to 1. In Section [6.9| we will prove it and explain in more detail why the standard 
approach is very ineffective under high heterogeneity assumptions. 

Algorithm 4 - Searcher(G, {Wj,..., Wh}, S') 

Input: Tournament G with V{G) = {vj, ...,U/i}, 

set of subsets of vertices: Wj,Wh, set of vertices S; 

Output: a pair of disjoint sets {X, Y) or a copy of the gadget H] 

begin 

if |G| = 0 then 

I output a tournament induced by S; 

end 

find in Wj a vertex w with the following property for every i = j + 1,h: 
w is adjacent to at least c\Wi\ vertices of Wi if {vj,Vi) £ E{G), and 
w is adjacent from at least c\Wi\ vertices of Wi if {vi,Vj) £ E{G)\ 
if w is found then 

update S S U {w}; 

let A™(i) (for i = j + 1 ,..., h) be: 

a set of outneighbors of w in Wi if (vj,Vi) £ E{G) and 
a set of inneighbors of w in Wi if (vj,Vi) E{G); 
let G’~ = G|{i;j+i,...,«(,} and update: Wi t— Nu:{i) for i= j + 
output Searcher{G'", {Wj +\,..., Wh}, S); 
else 

(by the Pigeonhole Principle) there exists a set X C Wj of order 
I A" I > h^j^+i index i* £ {j + 1,h} with the following property: 

either every x € X has at most c|ITi* | outneighbors in Wi* or 
every x £ X has at most c|ITi*| inneighbors in Wi*', 
let Y = Wi*. Output: {X,Y)', 
end 
end 


5 Conclusions and future work 

We showed new algorithm performing clustering in the digraph setting. Contrary 
to almost all of other results on clustering digraphs, the goal is not to partition 
the tournament into pseudo-strongly-connected components, but into subtour¬ 
naments that can be made transitive by reversing only few edges. This enables 
us to use the algorithm as a preprocessing phase of learning how to rank from 
heavily perturbed preference tournaments. To the best of our knowledge, this is 
the first approach of this kind that addresses at the same time and tightly con¬ 
nects two important problems of modern computer science: data clustering in the 
directed setting and ranking. As a corollary, we obtain new purely combinato¬ 
rial ranking algorithm and use it to effectively rank with preference tournaments 
constructed according to the majority-voting model. Experimental results show 







Title Suppressed Due to Excessive Length 


11 


the advantage of our approach over top state-of-the-art methods. The algorithm 
can be viewed as a general tool for finding local nonrandom substructures in the 
heterogeneous network that globally looks like a random graph. 

Algorithm 5 - Purify 

Input: A partitioning V, parameters: Pu,Pm and a precision parameter t 

Output: Set of nonoutliers TZ 

begin 

IZ i — 0 ; 

let T'^ be preference tournament with parameters , pm 

(thus of the same characteristic as T but obtained independently from T); 

for Pi e P do 

let threshold = IT’ipp^; 

for V £ Vi do 

let A+ be the set of outneighbors of v in T'^\Vi', 

let NV be the set of inneighbors of v in T'^IPi; 

choose s — 6>(log(n)) samples uniformly at random from the set 

N+ X NV; 

let r be the number of samples (u,w) such that (u,v) € 
if > threshold then 

I classify v as an outlier; 
else 

I 7Z i — 7Z U {n}; 

end 

end 

end 

output 7Z; 

end 

It aims to work well for very large sets of objects for which no entire preference 
graph is necessarily immediately known. It achieves this goal by acting locally 
on the preference graph, reconstructing clustering (that will be used later on to 
rank) part by part. Thus the authors plan to present the parall version of the 
algorithm in the next paper. It would be also interesting to use similar techniques 
to those presented here to propose new clustering algorithm in the undirected 
setting. 
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6 Appendix 

6.1 Introduction 

In the Appendix we will comment more on the technical Purify procedure from 
the main body of the paper. We also prove the correctness of the HeteroRank- 
ing and DagClustering algorithms by proving Theorem 4.1 and Theorem 4.2. 
Both theorems will be proved by showing slightly more general and technical 
results. Before proving Theorem 4.2, we will remind the reader the majority¬ 
voting scheme. Furthermore, we will show why the model we are analyzing here 
can be used to describe weighted preference tournament setting too. We will 
also show theoretical comparison between the quality of the ranking obtained 
by the HeteroRanking algorithm and state-of-the-art methods. At the very end 
we show results of the experiments comparing our method with state-of-the-art 
techniques. 

Let us remind that for a directed edge (v,w) in a digraph T we say that 
V is adjacent to w and w is adjacent from v. Alternatively, we may say that 
w is an outneighbor of v and v is an inneighbor of w. For a set A we denote 
by s(A) the set of all unordered pairs of different elements from A, namely: 
s(A) = {{x, y} : x,y G A,x y}. 

6.2 Ranking via majority-voting 

We will now remind the reader the majority-voting model in the context of the 
preference tournament. This is just one example how heterogeneous preference 
tournaments, encoding ranking statistics, may be straightforwardly created by 
the nonuniform data. It is probably the easiest one to describe. For other models 
(that we did not focus on in this paper) we can also benefit from applying the 
presented algorithm for the same reasons that will soon become obvious. 

Assume that the users compare certain products that come from different 
domains. Products from the same domain are being compared more often than 
from different domains (there are many reasons for why this might be the case, 
as mentioned earlier, it may even not make sense to compare different domains). 
Let us assume though that there exists some groundtruth ranking of all the 
objects. From what we have said so far it is clear that this ranking will play an 
important role only for pairs of objects within the same domain. 

Definition 1. Let U he the universe of all the objects. We denote by Vu the 
probability distribution on s{U) from which pairs of evaluated objects (in the 
training phase) or pairs of objects to evaluate (in the test phase) are being se¬ 
lected. 

Set s{U) forms an input for users’ evaluations and Vui{u,v}) is the prob¬ 
ability that next collected statistic will regard objects: u and v. In this pa¬ 
per we are interested in V-h that is very far from being uniform. Assume that 
U = Ui U ... UUk and most of the mass of the distribution V-h is concentrated 
on the set: s{Ui) U ... U silAk). Assume furthermore that when all the statistics 
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for an unordered pair of objects {u, n} are collected then the direction of an 
edge in the preference tournament is determined by voting. Direction of an edge 
is consistent with the one given by the majority of voters. Each statistic (each 
vote) is given independently at random and the probability that a user made 
a mistake (i.e. gives a preference not consistent with the groundtruth ranking) 
is Pmis < 5 which may be substantial (potentially even very close to |) also 
for objects from the same domain. A certain user is allowed to make mistakes 
but since many users will be taken into account while determining a direction 
of each edge in the preference tournament, the law of large numbers saves us. 

There is a caveat here that will lead us to the preference tournament model 
analyzed in the main body of the paper. If the number of statistics/votes is not 
large enough then taking the majority model may not be sufficient to recon¬ 
struct groundtruth ordering. Let us quantify this last statement. If we obtain K 
statistics for a certain unordered pair of objects {u, u} then standard concentra- 

tion inequalities such as Chernoff’s inequality, give us: Pmisiu,v) < , 

where: Pmis stands for the probability of an event that an edge between u and v 
in the preference tournament will not be consistent with the groundtruth clus¬ 
tering, (5 = (| — Pmis) and Psucc = 1 — Pmis- So if K is large enough then the 
upper bound on the probability of the mistake will be small. However if K is 
not too large, it may turn out that not only Pmis but even Pmis will be significant 
(this could be the case in particular when Pmis is very close to ^). In this case 
both the probability that a direction of an edge in the preference tournament 
will be right and wrong are lower-bounded by some substantial Pm- This is how 
the parameters and Pm of the preference tournament come into action. They 
reflect the nonuniform distribution 7^^. We will give now a full definition of the 
(M, m)-unbalanced training set that was used in the main body of the paper. 

Definition 2. Let U be the universe of all the objects and let {Ui, .--Mk] be 
the partitioning ofU. Let Vu be a probability distribution on s{U) that describes 
the distribution of the elements (unordered pairs of points) used as an input for 
training. LetT C UvlA be a training set (directed pairs encode users’ preferences) 
for which the corresponding unordered pairs were chosen from the distribution 
Vu and the preferences where chosen according to the majority-voting scheme 
with a paramter Pmis- We say that T is {M,m)-unbalanced with respect to the 
partitioning {Ui, - .-Mk} (or simply: (M, m)-unbalanced if the partitioning is clear 
from the context) for M > m if the following holds: 

^ 'h’uiiu, ^'DITI > M for u,v €Ui (i = 1,..., k), and 

- 'Pui{u,v})\T\ < m for u €Ui, V € Uj, i j. 

In other words, we want to get on average the feedback from at least M users 
for every pair of points from the same domain and at most m for every pair of 
points from different domains. 

Denote the majority-voting model with the (M, n)-unbalanced training set 
described above as Ai. We denote by T^vi the preference tournament model 
related to A4. The parameters: Pu,Pm of may be easily derived from the 
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parameters M,m,pmis of A4 so we will not give explicit formulas here. For us 
it sufficies to know that the bigger M and the smaller m, the bigger pm and the 
smaller we may take. 

6.3 Purify procedure 

The Purify procedure (we give it again in the Appendix for the convenience of 
the reader) gets as an input a partitioning, where every part is (1 — e)-pure with 
high probability. Its goal is to get rid of the set of outliers from each part of the 
partitioning since the pivot points that will be used in the Quicksort algorithm 
cannot be outliers. This task can be accomplished in several different ways and 
is much easier than the initial clustering problem on the directed graph since 
Purify operates on the very good approximation of the groundtruth clustering. 

Algorithm 5 - Purify 

Input: A partitioning V, parameters: Pu,Pm and a precision parameter e 

Output: Set of nonoutliers TZ 

begin 

TZ i — 0; 

let be preference tournament with parameters Pu,Pm 

(thus of the same characteristic as T but obtained independently from 

T); 

for T’i G T’ do 

let threshold = ^^ 32 ^ 

for V G 'Pi do 

let Nf be the set of outneighbors of v in T^^lPii 

let Nf be the set of inneighbors of v in 

choose s = 0 (log(n)) samples uniformly at random from the 

set Nf X Nf ; 

let r be the number of samples {u,w) such that {u,v) G E(T‘^); 
if > threshold then 

I classify v as an outlier; 
else 

I 'JZ i — 'JZ U {n}; 

end 

end 

end 

output TZ] 

end 

One possible approach focuses on the number of directed triangles touching 
a given point of one of the computed clusters. If that point is the outlier we 
expect quadratic number of directed triangles in the cluster touching that point 
with the multiplicative constant next to the quadratic factor much bigger than 
e. On the other hand, if it is not an outlier then the expected number of di¬ 
rected trangles touching that point will be at most qudratic with constant next 
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to the quadratic factor of the order of e. That observations immediately leads to 
the algorithm detecting outliers: For every point of the cluster we compute the 
number of directed triangles touching this point and if this number is greater 
that certain threshold then we classify the point as an outlier. There are two 
things that should be noticed here. Counting an exact number of the directed 
triangles touching any given point v can take 0{n'^) time. Fortunately we do not 
need an exact number, what is really needed is a good enough approximation. 
This approximation might be obtained by sampling randomly from the set of 
unordered pairs: u,w, where u G iV+, w G N~, and: is the set of outneigh- 

bors of V and N~ is the set of inneighbors of v. While sampling we count the 
fraction of unordered pairs u,v such that {u,v) G E{T). If this fraction is larger 
than a certain threshold then we classify a point as an outlier. To resolve the 
issue with a dependence between the output of the HeteroRanking algorithm 
and the direction of edges under investigation in the Purify subprocedure, we 
run Purify on the new preference tournament T‘^ obtained independently from 
T but for a partitioning output by the HeteroRanking algorithm. Tournament 
is obtained from the same distribution as T. In the subsection where we give 
the proof of Theorem 4.2 we also prove correcntess of the Purify algorithm pre¬ 
sented above. We will prove in particular that the number of samples s needed 
is a small multiplicity of log(n). 

6.4 Tools 

We will need two standard concentration inequalities. The First one is Chernoff’s 
inequality: 

Theorem 3. Let (5 > 0. Let X = Ym=i where XiS are independent and each 
Xi equals 1 with probability pi and is zero otherwise. Denote p. = EX = 

Then the following holds: 

- P(X > (1 -f S))p < e~^^, 

- P(X < (1 - 6))p < e~^^. 

We will also need Azuma’s inequality: 

Theorem 4. Let {Zn,n > 1} be a martingale. Let Zq = 0. Assume that —a < 
Zn — Zn-i < /? for every n > 1. Then the following is true: 

- P(Zn > nc) < e , 

_ 2mc^ 

- P(Zn < —nc) < e . 

More generally, if we have: —at < Zi — Zi-i < jdi then: 

_ 2a^ 

- P(^n < -a) < e ^ 

_ 2a^ 

- P(Z„ >a)<e EflpMXhP . 
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6.5 Gadget tournament H 

Gadget tournament is a useful well-known mathematical tool from the main body 
of the paper which can be easily constructed. However for the completeness we 
comment more on the gadget here. In this section we briefly describel how the 
gadget tournament used by the HeteroRanking algorithm should be constructed. 

We want every subset of V(H) of order at least to have at least one 

backward edge under every ordering of vertices. We call this property the gadget- 
property. Tournament H can be constructed randomly as the next lemma states: 

Lemma 1. Let H he a tournament satisfying: 4 iog(ft,)-|-i ^ ~ log(l — p)) 

in which the direction of every edge is chosen independently at random with 
probability ^. Then with probability at least p tournament H satisfies the gadget 
property. 

Proof. Denote p, = y(^) = ^(1 ~ Denote by p] the probability that 

some of the ^-element subsets of V{H) induces at most (1 — S)p backward 
edges under some ordering of vertices. Let us fix an ordering of vertices 9 and 
lets enumerate all the edges of the tournament. Let Xi be an indicator ran¬ 
dom variable that is equal to 1 if edge is backward and is zero otherwise. 
If we dehne: Zm = ~ then we see that {Zm : m = 1,2,...} is a 

martingale. Besides, we have: —0.5 < Z^ < 0.5. Thus, from the Azuma’s in- 
equality, we get: P(Z„ < —nc) < , where: n = (^). Therefore we obtain: 

F{B < /i(l — 6)) < e~^^ , where B = Xi-\- ...-\-Xn is a random variable counting 
the number of backward edges. Now, if we sum over all (^)! possible orderings 
of vertices of the fixed subset of order ^ and over all possible subsets of or¬ 
der we obtain: pj < Evaluating this expression, we obtain: 

Ps ^ e''“ . Now, if we take 5 = 1 and take h satisfying the 

assumptions of the lemma, we obtain: p\ < p. That completes the proof. 

Note that in particular we have proved that for every h satisfying: 4 iogp ^)_|_4 > 
ku there exists an h-vertex tournament satisfying gadget-property. In practice 
we even do not need to make a random construction to obtain H. There are 
plenty deterministic constructions of tournaments with pseudo-random proper¬ 
ties, in particular with a gadget-property. For example one can use the family 
of quadratic residue tournaments. The proof given above is useful though to get 
a simple upper bound on the order of tournaments that may serve as gadgets. 

6.6 Proof of Theorem 4.1 

We give here detailed proof of the correctness of Theorem 4.1. We start with the 
lemma that plays an important role in the procedure Find used by DagCluster- 
ing. 

Lemma 2. Let c > 0, let H be a tournament and let T be an H-free digraph. 
Then V (T) contains two disjoint subsets A, B satisfying: 
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- \B\>c>^-\l\ and 

— either every vertex from A is adjacent to at most c\B\ vertices from B or 
every vertex from A is adjacent from at most c\B\ vertices in B. 

That lemma was stated and proved in a little bit different setting (undirected 
graphs) in [26) . Since the proof of the directed setting is very similar, we refer 
the reader to [26] for details (Lemma 1.5, p.40). The procedure Find mimics the 
proof of the lemma above. In particular, whenever Find outputs two sets: X, Y 
we have: 

- 1^1 > 1^1 > J and 

— either every vertex from X is adjacent to at most c|y| vertices of Y or every 
vertex from X is adjacent from at most c\Y\ vertices of Y. 

An important conclusion from the lemma is that the absence of the tourna¬ 
ment F[ in T implies the property that random tournaments satisfy with very 
small probability, namely the existence of two substantial (linear) sets: A, B with 
directed density between them close to one or zero. The intuition is now that 
with high probability most of the vertices from these two sets came in fact from 
the same domain (note that tournaments induced by domains are very nonran¬ 
dom). Thus, by getting A and B, we can with high probability extract very 
’’pure chunk”. This chunk can be then added to the part of the related domain 
that has been already extracted. We will make all these observations much more 
precise a little bit later. 

In the DagClustering algorithm we delete from a digraph all edges of the 
copy of F[ it the copy was found. The explanation is as follows: if the copy was 
found then one of its edges must be a backward edge within some domain under 
its canonical ordering (this easy observation is a consequence of the definition of 
the gadget tournament, we will see why later). We call edges like that bad edges. 
We dont know exactly which edges of the copy are bad and that is why we delete 
all of them. By doing it systematically, we eventually get rid of all bad edges. 
Doing it we also get rid of edges that are not necessarily bad. Fortunately, with 
high probability the number of bad edges is not very large thus while clearing 
up the entire digraph from bad edges we get rid of not too many other edges. 
Thus the detection of the copy of H and deletion of its edges from the digraph 
is a convenient way to detect bad edges without doing much harm to the overal 
structure of the digraph. The following is a useful property of gadgets: 

Lemma 3. Let Tj be a digraph from the DagClustering algorithm. Then if Ti 
contains a copy of FI, one of the edges of H is a bad edge. 

Proof. Assume by contradiction that the found copy of H does not contain a 
bad edge. By the Pigeonhole principle, at least ^ vertices of the found copy of 
F[ were taken from the same domain. Call this set X. Take a canonical ordering 
of the vertices from X. From the gadget property we know that this ordering 
induces at least one backward edges. That contradicts our previous assumption. 
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Now we need to quantify the statement that a preference tournament T that 
is an input of the algorithm with high probability does not have too many bad 
edges. Denote by rii, the sizes of the domains. 

Lemma 4. Let g{n) be some positive function. Let S = max(2, ^ 

Denote M = + Then the probability that a preference 

tournament T contains more than M bad edges is at most . 

Proof. Let i? be a random variable that counts the number of bad edges in 
T. Note that B is a sum of C 2 ) random variables Xi, where each Xi 

corresponds to a certain pair of vertices within a particular domain. Every Xi is 
one with probability pj, where: j is the number of the domain that the vertices 
corresponding to Xi were taken from, and zero otherwise. Thus we have: fi = 
EB = (” 2 )^* = Y!i=i ^(1 - Now, for any 5 > 0 the probability p] 

that the number of backward edges is more than (1 + 6)p,, is (by at most 
y^fc nizin_L) 

g 2 +d z^i=i 2 V m! ^ expression on the LHS of the last inequality is at 

k 

most if: > log(5(n)). One can easily notice that 

this inequality is satisfied for our choice of the value of <5 from the statement of 
the lemma. That completes the proof. 

Let us remind the definition of (1 — e)-purity. We say that a set of vertices 
X is (1 — e)-pure if all but at most an e-fraction of all the vertices of X are from 
the same domain. For two disjoint sets: X, Y we denote by E{X, Y) the number 
of directed edges going from X to Y. Intuitively speaking, we expect to have 
substantial numbers of directed edges going from both: X to F and Y to X if X 
and Y contain substantial chunks from different domains. Below we make this 
statement precise and give it in the form that will be very useful later in the 
proof (parameters Pu,Pm used in the statement of the next lemma were already 
defined in the section describing preference tournament model): 

Lemma 5. Let Sm, M be as in Lemma^ Let T be a preference tournament with 
|T| = n and each domain of size at least two. Assume that ku > 2. Let e satisfy: 

^ > e > T." . Let h > 0 and g{n) be a positive function. Assume 

that 0 < c < \epm and n > max(^, \/^log(g(^)))- Denote 

by T'^ a tournament obtained from the preference tournament T by deleting some 
\M edges (notice that we do not assume anything about the mechanism accord¬ 
ing to which those edges were deleted, in particular the set of deleted edges might 
be highly correlated with the overal structure ofT). Let £ be the following event: 

— there exist two sets: A and B in T'^ such that: \A\ > c^~^[j((\, \B\ > 
AU B is not (1 — e)-pure and either every vertex of A is ad¬ 
jacent to at most c|B| vertices in B or every vertex in A is adjacent from at 
most c\B\ vertices in B. 

Then the probability p^ that £ holds is at most . 
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Proof. Take two sets: A and B satisfying the property in the statement of the 
lemma. From the Pigeonhole principle we know that A contains a subset A 
of size |A I > ^1^1 > e\A\ that belongs entirely to one of the groundtruth 
domains. Denote this domain by V. If B does not contain at least (1 — e)|i?| 
vertices from T) then it contains at least e|i?| vertices from other domains. On 
the other hand, if B contains at least (1 — e)\B\ vertices from T) then A contains 
at least e\A\ vertices from other domains (since AUB is not (1 —e)-pure). In both 
scenarios we conclude that there exist two sets: Xi C A and X 2 B such that 
no groundtruth domain intersects both of them. From the property of A and B 
we know that either the number of directed edges going from Xi to X 2 in T‘^ is 
at most |Xi| • c|i?| or the number of directed edges going from X 2 to Xi in T‘^ is 
at most |Xi| -cliJl. Thus the number of those edges is at most ||Xi||X 2 |. We can 
conclude that an event £ is contained in the following event J-: there exists a pair 
of sets: Xi,X 2 , such that: |Xi| > ec^“^[pj, \X 2 \ > either the 

number of directed edges in T going from Xi to X 2 is at most ^\Xi\\X 2 \+ M or 
the number of directed edges in T going from X 2 to Xi is at most ^\Xi \\X 2 \ + M. 
We have: < P(J^). Let us calculate now the probability of F. Lets first fix Xi 
and X 2 and the direction where most of the directed edges between Xi and X 2 
go. This can be done in at most 2” • 2” different ways. Assume, without loss of 
generality that the ’’preferable direction” is from Xi to X 2 . For a pair {xi,X 2 ) 
such that: xi G Xi and X 2 G X 2 denote by y(xi,x 2 ) ^ random variable that is zero 
if there exists a directed edge from xi to X 2 in T and is one otherwise. Denote: 
^ = J 2 (xuX 2 )&XixX 2 ^{xi,x 2 )- We know that is one with probability at 

least Pm- Thus EY > |Xi||X 2 |pm- On the other hand, from the properties of 
Xi and X 2 we know that: Y < ^|Ari||Ar 2 | + M. Therefore an inequality: Y < 
^\Xi\\X 2 \+M implies: Y - EY < -\Xi\\X 2 \{pm - f - Now, knowing 

the lower bounds on jATil and ^" 2 ,using a general Azuma’s inequality (see: 
for specific Xi and X 2 and a union bound over all pairs (Xi,X 2 ), we obtain: 
P(J-) < 2" • 2"exp(-2e2 [pj[pjc2 ^-2(p„ - £ - )"). Thus we have: 


P( J-) < 2" • 2" exp(-2e2 (1 _ A)(i _ 


M 


One can check that under our choice of parameters from the assumptions of the 
lemma we have: P(A) < Since £ C A, we also have P(£) < and that 
completes the proof. 


We need one more observation before proving Theorem 4.1. In the clustering 
algorithm when we extract a set of vertices Z we need to decide to which partial 
cluster this set should be added (it could be also the case that Z will form a new 
cluster). If we know that all the sets under consideration are pure enough then 
we can use this fact to make a right choice. When we consider partial cluster 
Pi we can find the ordering of vertices oi Z (J Pt that somehow approximates 
an optimal ordering with the minimum number of backward edges (this can be 
done for example with the use of the Quicksort algorithm). If the number of 
backward edges under this ordering is big enough then with high probability we 
can conclude that Pi and Z were taken from different domains and so Z should 
not be added to Pi. Otherwise, with very high probability they come from the 
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same domain and that is why we should merge Z with Pi. We make this intuitive 
statement more formal below: 


Lemma 6 . Let T be a preference tournament with n = |T| with ^ > 12. 
Let 0 < e < ^Pm and i5i = — 1. Let w{n) be a positive function such 

that w{n) < 2"“^. Assume furthermore that n > Taax.{ ^h-i^ j '^^ 2 h- 2 ^ 2 p^ ) o.n-d 
> ^ 2 h- 2 %p ■ Let p'^ be the probability of the following event Q: 

~ there exist two sets X, Y in V{T) that are both {l — e)-pure, |X| > c^~^[^\, 
in > such that most of the vertices of X are taken from the same 

domain that most of the vertices of Y and there exists an ordering of X UY 
with more than (l + (5i + 2e)|Xm|pu backward edges in T with one endpoint 
in X and the other in Y or 

— there exist two sets X, Y in V{T) that are both {l — e)-pure, |X| > c^~^[^\, 
m > c^~^ , such that most of the vertices of X are taken from a different 

domain that most of the vertices of Y and there exists an ordering of X UY 
with at most 3(1 + (5i + 2e)|X||y|p„ backward edges in T with one endpoint 
in X and the other in Y 


Thenp^ < 

Proof. Let us first consider two sets X and Y such that most of the vertices of X 
came from the same domain as most of the vertices of Y. Denote this domain by 
V. Order the vertices of {XUY)^!) according to the canonical ordering of V and 
add the remaining vertices of XUY to that ordered sequence in the arbitrary way. 
The number Bi of backward edges in T induced by that ordering with one end¬ 
point in X, one in Y and involving points not from V is (from (1 — e)-purity) at 
most 2e|X| |F|. Denote by i ?2 the number of backward edges in T induced by that 
ordering with one endpoint in X, one in Y and involving only points from V. By 
the similar analysis as in the proofs of the previous lemmas, we conclude (using 

that P(i ?2 > (1 + <5i)m) < e =+'* 1 ^, where p, = |X||y|p„. Thus the probability 


s 


that: B 1 +B 2 > (l-l-i5i)/r-|- is at most e ^+■*1 ^. If we now sum over all possi¬ 
ble subsets X, Y with |X| > c^~^ , |y| > c^~^ then we get the following 

upper bound: p'^ < 2^"'exp(— ^ ^ — "^—Pu)- Now let us assume 

that most of the vertices of X are from different domain than most of the vertices 
of Y (second scenario in the statement of the lemma). Fix some ordering of ver¬ 
tices and sets X and Y. If we denote by B the number of backward edges with one 
endpoint of X and one in Y under this given ordering, then, using similar anal¬ 
ysis as before, we conclude that the probability that B < 3(1 -I- (5i -I- 2e)|X| |F|p„ 
_ sl_ ' 

is at most e ^+^2 ^ ^ where: ^2 = 5 - — and p = |X| jTIf we now sum over 

all possible orderings of vertices ancf all possible choices of X and Y then we 


get the following upper bound: < 2 ^"'n! exp(— 


2+<5i 




N-i) 


Pm)' 


We have: ^ Pa ~^Pb- Under our assumptions on the values of parameters used 
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in the statement of the lemma one can check that: < 2w{n) Pb — 2w{n) • 

(This time we will not show the calculations in more detail since they do not 
involve anything more than a tedious algebra.) That completes the proof. 

We are ready to prove Theorem 4.1. We will in fact prove more general yet 
also much more technical result from which Theorem 4.1 follows. 


Theorem 5. Let q(n),g(n) be positive functions. Assume that T is a prefer¬ 
ence tournament of n vertices and with parameters: PmPm,ku and such that 
het{T) > 12. Let us assume that every domain of T contains at least two ver¬ 
tices and T consists of k domains. Let H he a gadget tournament used by the 
algorithm. Denote h = \H\. Let e > 0 be a precision parameter. Assume that 
(i+^M)fc„/i < e < niin(^, \pm), where: 6 m = max(2, - 4iog(g(«)) 


2h 


het(T) 

US assume that 


Let 


> 


288 h^ 


log(n) — c^^~^e^prt 


and 


> 


80h-‘ log(2) 
^4^2h-2^2 


where: c = \epm- 


Then DagClustering algorithm outputs {l — e)-pure partitioning of all but at most 
an e-fraction of all the vertices ofV{T) with probability Pgucc > (^~ (gfn) 
27 rWT))(l — 0 (p/)), where: pf is a probability that the method proposed in JT]/ does 
not output the 3-approximation of the feedback arc set problem. 


Proof. Note that obviously during the entire execution of the algorithm every 
time we perform an operation on the tournament Ti we have: |Ti| > en. 

Let M be as in Lemma Let A be the following event: tournament T has 
no more than M bad edges. Let B be the following event: there do not exist two 
sets: A and B in Ti during the entire execution of the algorithm such that: 

— \A\>c^-\^^\, 

— A U i? is not (1 — e)-pure and 

— either every vertex of A is adjacent to at most c|i3| vertices in B or every 
vertex in A is adjacent from at most c|i3| vertices in B. 

Let C be the following event: 

— there do not exist two sets X, Y in V(T) that are both (1 — e)-pure, |X| > 

|y| > such that most of the vertices of X are taken 

from the same domain that most of the vertices of Y and there exists an 
ordering oi X Li Y with more than (1 + + 2e)|Jf||y|p„ backward edges in 

T with one endpoint in X and the other in Y and 

— there do not exist two sets X, Y in V(T) that are both (1 — e)-pure, |X| > 

1^1 — most of the vertices of X are taken 

from a different domain that most of the vertices of Y and there exists an 
ordering of X LI Y with at most 3(1 + + 2e)|X||F|p„ backward edges in T 

with one endpoint in X and the other in Y 

Notice that under our choice of parameters, using lemmas: S [H and 1 ^ we 
can conclude that 

1 1 1 

q{n) g{n) 2 "'“i ’ 


P(^^ U U C^) < 
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where stands for a complement of an event X. Let be an event that all of: 
A, B, C hold. Then we have: P(J') > 1 — + 2 ^)- assume that 

J holds. By Lemmaj^we know that every time the subprocedure Find detects a 
copy of H, one of its edges is a bad edge. Since the total number of bad edges in 
T is M and every time a bad edge is detected a set of ( 2 ) edges of T (containing 
this edge) is being removed, we conclude that the algorithm removes at most 
{’^)M edges of T. We can also conclude that Find returns a copy of H at most 
M times. Let us assume now that Find returns two sets: X, Y. By Lemmaj^and 
the fact that J C B we conclude that X UY is (1 — e)-pure. If we now assume 
inductively that all ViS from the algorithm are (1 — e)-pure, and the procedure 
proposed in [1] gives a 3-approximation of the feedback arc set problem, then 
using the the inclusion: J' C C, we conclude that the partitioning is (1 — e)-pure 
during the entire execution of the algorithm. The algorithm obviously terminates 
since whenever Find does not detect a copy of FI at least one vertex of T is being 
deleted (and as we said earlier, a copy of H is found at most M times). Finally 
notice that the number of runs of Find when two sets are being output is constant 
(since the sets that are found by Find are of linear size in n and every time they 
are found they are deleted from Ti). The procedure of [T] is run only when Find 
outputs two sets thus, according to what we have just said, this procedure is run 
constant number of times. This observation and the remark that the success of 
the procedure is independent of the input it acts on completes the proof. 

Theorem 4.1 follows now immediately from Theoremj^if we notice that under 
the assumptions from the statement of Theorem 4.1, we have: Sm = 2. 

6.7 Proof of Theorem 4.2 

We are ready to prove Theorem 4.2. 

Proof. We have already proved Theorem 4.1 and as we will see now, this is main 
ingredient of the proof of Theorem 4.2. Notice first that it suffices to show that 
procedure Purify outputs the set of all nonoutliers. Indeed, assume this is the 
case. Out of N coming queries at least (on average) will have both ver¬ 

tices from the same domain. At most an e-fraction of the set of that queries 
(on average) will have its first vertex in the set that was not partitioned by the 
DagClustering algorithm and this will be also true for the second vertex. Finally, 
by the similar analysis, at least 2e-fraction of the queries with both vertices in 
the same domain and both partitioned by the algorithm will have at least one 
of its vertex in the set of outliers of this domain. If we now take those queries 
for which this is not the case then it suffices to notice that the queries that do 
not correspond to backward edges in the ordering obtained by the algorithm and 
do not correspond to backward edges in the canonical ordering of domains are 
answered correctly. Since the Quicksort algorithm produces a 3-approximation 
of the feedback arc set problem, we are done. 

All we need to do is to prove the correctness of the Purify procedure. Fix a 
part Pi of the partitioning and let u G Pi be a vertex that is not an outlier. Let 
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A" be the number of directed triangles in T‘^\'Pi that are touching it. Let Ai 
be those of these triangles that have at least one vertex in the set of outliers. 
Since Vi is (1 — e)-pure, we trivially get: \A^\ < e\Vi\'^. Let A 2 be those triangles 
from A'" that have all three vertices in the set of nonoutliers. But every such 
triangle needs to have an edge that is backward under canonical ordering of the 
nonoutliers. From this we get: A V < B'P' + ■ V^, where is the set of 

backward edges under canonical orderings of nonoutliers and is the set 

of backward edges under canonical orderings of nonoutliers with one endpoint 
in V. Thus we get: A"" < A\ + A 2 < el'Pip + • Vi- Now let v be 

from the set of outliers. Let TZ denote the set of nonoutliers and let TZi be its 
subset of first vertices under canonical ordering and let TZ 2 be its subset 
of last ^ vertices under canonical ordering. Notice that 7ii,7i2 > ^^\Vi\. It 
is also easy to see that the number of directed triangles touching v is at least: 
/i" > Thus, if both: |7^l| > and |7^2| > 

we have: A" > ~ it suffices to use Chernoff’s inequality 

and the union bound, as we have done so far many times, to see that for the 
choice of e from the statement of the theorem with probability 1 — o(l) we have 
both: 

— A^ > ^^ 32 ^ every outlier v, and 

— A" < 32 foi' every nonoutlier 

as long as \Vi\ is large enough. We leave details to the reader this time. Since 
\Vi\ is linear in n and to approximate A" good enough with probability 1 — o(l) 
it trivially suffices to select 0(log(n)) random samples, we are done. 

6.8 Time complexity of the algorithms 

Let us analize time complexity of the DagClustering algorithm first. One run 
of the algorithm presented in [1] requires 0(nlog(n)) time on average (and this 
running time is highly concentrated around its mean) but theoretically to be sure 
with probability 1 — o(l) that the ranking that is found is a 3-approximation 
we need to perform it more than once. It suffices to perform it log(n) times (in 
practice it is not necessary to run it more than few times and this is what we 
did in our experiments). We then output the ordering that gives the smallest 
number of backward edges. This check will require O(n^) time. The mulitple 
run of the routine from [1] is what we call Quicksort subroutine in the algorith¬ 
mic section of the main body of the paper. We have already noticed that the 
subroutine Quicksort is called constant number of times in the DagClustering 
algorithm. We have already observed that Find outputs two sets: X and Y con¬ 
stant number of times. Assume that event ZT from the proof of Theorem]^ holds. 
We have also observed that Find detects a copy of at most M times (see: the 
proof of Theoremj^for the definition of M). Now, notice that a straightforward 
implementation of Find requires Oiri^) time. So conditioned on the running 
time is 0(n^M) with high probability. M is usually much smaller than n thus 
the running time is slightly superquadratic. In practice it is even close to linear 
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due to several small heurstics that were used to speedup the entire algorithm 
(see: discussion in the experimental section). To see that the running time of the 
HeteroRanking algorithm is also slightly superquadratic it suffices to observe 
that the straightforward implementation of the Purify procedure takes 0{'nf) 
time. 


6.9 HeteroRanking versus state-of-the-art ranking methods 

In this short subsection we would like to explain a little bit more quantitatively 
why in the heteregenous setting the algorithms such as Quicksort and other 
methods that aim to find an ordering with small number of backward edges can¬ 
not succeed alone and need to act as an input that was previously preprocessed 
by some digraph clustering algorithm. Let us focus on the Quicksort algorithm 
first since it is very easy to implement. Let us take the very simple yet difficult 
enough for the Quicksort algorithm setting of two domains: Vi and I? 2 - Assume 
that both are of the same size Assume that for every u G T>i, v G T >2 there 
exists an edge {u, v) in the preference tournament T with probability at least Pm 
and there exists an edge (v, u) in the preference tournament T with probability 
at least Pm- The Quicksort chooses uniformly at random a pivot point p. By 
symmetry, assume without loss of generality that p G 'D 2 - Let 'D\ be the set of 
first I vertices of Vi under its canonical ordering and let Vl be the set of last 
^ I vertices of Vi under its canonical ordering. Let A^i be the set of outneigh- 
bors of p in 'D\ and let N 2 be the set of inneighbors of pvaV^. Notice that under 
the first reordering of vertices in the Quicksort algorithm all the points from 
N 2 will be ordered before all the points from A^i. Since every point from A^i is 
adjacent to every point from N 2 , after first reordering of vertices we will produce 
at least |A^i||A^ 2 | backward edges. Then obviously the number of backward edges 
will be at least |A^i||N 2 | (in fact it is easy to prove that it will increase even more 
but lets take the simple bound we obtained from the first iteration). Notice that 
the expected size of Ni is ^pm and this is also true for N 2 . Thus the average 

number of backward edges in the ranking output by the Quicksort algorithm is 
2 

at least \pm- What is even more important, the backward edges we were talk¬ 
ing about so far had the property that both their endpoints were taken from the 
same domain. Therefore it is easy to see that the obtained ranking is of very bad 
quality and will incorrectly answer a significant fraction of all coming queries. In 
particular, there is no chance to obtain recall close to one. However, as we have 
already showed, a purely combinatorial digraph clustering mechanism combined 
with the Quicksort algorithm enables to achieve it. The problem we raised above 
is not related only to the Quicksort method. One can easily prove that for a 
tournament with k domains of size n each, where the directions of edges between 
different domains are chosen independently at random, the number of backward 
edges under every ordering is quadratic with probability close to 1. Presented 
digraph clustering mechanism is crucial for filtering out low-quality information 
and detecting regions of much lower entropy that correspond to much denser 
regions of the underlying majority-voting model probability distribution Vu ■ 
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6.10 Weighted setting 

Notice that in this setting we consider not just digraphs but tournaments. Besides 
we assume that preference tournaments are unweighted. This however does not 
narrow the generality of our analysis at all. All the results we obtained transform 
naturally to the weighted digraph setting. However considering random model 
of the preference tournament in fact enables us to accurately mimic digraph 
weighted setting, even without making any transformation. The lack of some 
edges in the general digraph setting was introduced to emulate the scenario, 
where there are no statistics regarding some pair of objects or those statistics 
are very poor. This is straightforwardly simulated in our model by edges between 
points from different domains. Both possible directions of those edges have sig¬ 
nificant probabilities of being chosen. In particular, if both are equal to | then 
the expected ’’signed weight” of the corresponding pair of points is 0 which 
means than an edge is absent. In the general digraph model the weights were 
introduced to emulate the fact that some statistics are more important or the 
users are more confident about preferences between some objects than others. 
All the weights were takne from the interval [0,1]. But of course weights from 
that interval became probabilities in our model. Thus we do not lose anything 
by considering unweighted preference tournaments. 


6.11 Experiments 

We conducted several experiments to test the ranking mechanism of the Het- 
eroRanking algorithm as well as the quality of the clustering produced by the 
DagClustering procedure. We also compared our results with those obtained by 
the state-of-the-art techniques. 


Table 1. Table comparing the best ranking of the four constructed by: [T], [8], |23] and 
|24| with the HeteroRanking algorithm. Tests were conducted for C = 15, depth = 12 
and V = 100 votes for every pair of objects within a groundtruth cluster. Number of 
objects is given in 10® units and ratio in 10“® units. 


n [in 10*^] 

1 

2 

2.5 

3 

3.5 

4 

4.5 

5 

5.5 

6 

6.5 

k 

2 

2 

2 

2 

3 

3 

3 

3 

4 

4 

4 

ratio [in 10“^] 

2 

4 

6 

8 

12 

14 

16 

18 

19 

20 

22 

Psucc 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.6 

0.6 

0.6 

0.6 

0.6 

^bestof four 

0.33 

0.27 

0.28 

0.30 

0.24 

0.33 

0.35 

0.32 

0.34 

0.35 

0.34 

^clust 

0.09 

0.12 

0.16 

0.14 

0.18 

0.17 

0.13 

0.15 

0.17 

0.2 

0.22 


Table 1 compares the quality of the ranking produced by the HeteroRanking 
algorithm with the best one from the following four: m. 0 : 1231 and 
[24] . The results cover: different number of domains and quality characteristics 
of the statistics published according to the majority-voting mechanism. We use 
the following notation: n - number of all the objects, fc-number of groundtruth 
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clusters, ratio - the ratio between the number of votes for pairs of objects from 
different clusters and from the same cluster, Psucc - the probability that a voter 
will correctly classify a given pair of objects, Cbestoffour - generalization error of 
the best of four state-of-the-art approaches, Cdust - generalization error of the 
HeteroRanking algorithm. 


Table 2. Table comparing the best ranking of the four constructed by: HI, 0.1131 and 
[24] with the HeteroRanking algorithm. This time we also change the depth parameter. 
Tests were conducted for C = 15, V = 100 and n = 6500. 


k 

2 

2 

2 

2 

3 

3 

3 

3 

4 

4 

4 

ratio [in lO”'^] 

2 

4 

8 

10 

12 

14 

16 

20 

22 

23 

25 

Psucc 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.6 

0.6 

0.6 

0.6 

depth 

8 

9 

10 

11 

12 

20 

25 

30 

40 

50 

55 

^bestof four 

0.35 

0.33 

0.35 

0.32 

0.31 

0.34 

0.34 

0.28 

0.24 

0.23 

0.23 

^clust 

0.35 

0.36 

0.30 

0.20 

0.14 

0.12 

0.12 

0.16 

0.18 

0.20 

0.22 


If not explicitly stated otherwise then the vertices are uniformly splitted be¬ 
tween clusters. It was tested experimentally that adding a simple heuristic to 
the Searcher subprocedure of Find algorithm can significantly improve the run¬ 
ning time without affecting accuracy. In the Searcher we first randomly permute 
the vertices of the forbidden pattern establishing a random order in which we 
will look for them. To estimate whether the set of out/inneighbors of the given 
vertex hi is large enough we perform simple sampling. Then, if we have already 
found C copies of FI {C is a. parameter), and in the current run of Find we have 
found more than d vertices of the potential embedding (we will call d the depth 
parameter) we rerun Find. As a gadget iJ we use a random tournament of 60 
vertices since it was experimentally verified that this order of the tournament is 
good enough to obtain high-quality ranking. 


Table 3. Table comparing the best ranking of the four constructed by: m. 0. m 
and \2^ with the HeteroRanking algorithm. Tests were conducted for C = 15, k = 4 
and V = 100. This time vertices are not uniformly splitted across the clusters and the 
sizes of the clusters are given as parameters: ni, n 2 , ns and 124 . 


ni [in 10^] 

0.5 

0.5 

0.5 

0.5 

0.6 

0.6 

0.6 

0.6 

0.6 

0.7 

0.7 

722 [in 10^] 

1 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

ns [in lO"*] 

2 

2 

2 

2 

2 

2 

2.5 

2.5 

2.5 

3 

3 

n4 [in 10^1 

1.5 

1.5 

1.7 

1.7 

1.8 

1.8 

1.8 

1.9 

1.9 

2 

2 

ratio [in lO”''] 

2 

4 

8 

10 

12 

14 

16 

20 

22 

23 

25 

Psucc 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.6 

0.6 

0.6 

0.6 

depth 

8 

9 

10 

11 

12 

20 

25 

30 

40 

50 

55 

^bestof four 

0.35 

0.31 

0.29 

0.28 

0.19 

0.18 

0.17 

0.19 

0.23 

0.22 

0.19 

^clust 

0.37 

0.35 

0.20 

0.1 

0.12 

0.13 

0.13 

0.20 

0.25 

0.20 

0.21 
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Table 2 compares our results with the same methods as Table 1 from the 
main body of the paper, but for different values of parameter depth. Additional 
results showing comparison of our approach with existing methods are presented 
in Table 3. The setting is similar to this for Table 1 and 2 but this time domains 
are of different sizes. 

Figure 1 and Figure 2 show how the generalization error depends on the qual¬ 
ity characteristic of the statistics obtained from the majority-voting mechanism. 
Algorithm HeteroRanking outperforms state-of-the-art methods for statistics of 
lower quality (smaller ratio values). 



Fig. 1. Diagrams comparing HeteroRanking method with |23| . | 24| . and the Quick- 
Sort algorithm from [1]. Tests were performed for C — 15, n = 7000, V = 100 and 
depth = 15. The number of clusters is: (a) fc = 3, (b) fc = 4. 


We also performed experiments testing how many times in practice we need 
to run Find procedure. It turns out that the theoretical bounds we gave were 
very pesimistic and in fact the number of iterations is much smaller. This implies 
much better running time. We checked experimentally that much smaller than 
assumed number of iterations comes from the fact that in practice the sets X, Y 
in the Find procedure are detected much earlier and there are also much larger 
(the results of the experiments are presented on Figure 3). Thus when the piece 
of the domain is being found in the HeteroRanking algorithm, it is very large 
on average. That in turn implies much faster reconstruction of the domain (up 
to the precision parameter e.) We plan to investigate this phenomenon more 
closely from the theoretical point of view in the subsequent papers regarding 
the topic. We should also notice that, as was verified by us experimentally, the 
Purify subroutine does not necessarily need to be used to obtain good-quality 
ranking. Since the partitioning computed at earlier stages of the algorithm is 
very pure, the outliers are chosen as pivot points with very low probability and 
do not affect the overall quality of the ranking mechanism. 
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Fig. 2. Diagrams comparing HeteroRanking method with m, m, IS] and the Quick- 
Sort algorithm from [1]. Tests were performed for C = 15, n = 7000, V — 100 and 
depth = 15. The number of clusters is: (c) fc = 5, (d) k = Q. 



number of iterations 


Fig. 3. Diagram presenting how the purity of the computed clusters depends on the 
number of runs of the Find procedure which is the most expensive part of the Het¬ 
eroRanking algorithm. The purity is defined as the fraction of the groundtruth domain 
that was already reconstructed. The tests were performed for different sizes of the set 
of objects: n = 3000,3500,4000,4500,5000, for C = 12, depth — 14, ratio = 0.1 and 
V = 200. 











































































